Hive - A Warehousing Solution Over a Map-Reduce Framework

نویسندگان

  • Ashish Thusoo
  • Joydeep Sen Sarma
  • Namit Jain
  • Zheng Shao
  • Prasad Chakka
  • Suresh Anthony
  • Hao Liu
  • Pete Wyckoff
  • Raghotham Murthy
چکیده

The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop [3] is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this paper, we present Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language HiveQL, which are compiled into map-reduce jobs executed on Hadoop. In addition, HiveQL supports custom map-reduce scripts to be plugged into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog, Hive-Metastore, containing schemas and statistics, which is useful in data exploration and query optimization. In Facebook, the Hive warehouse contains several thousand tables with over 700 terabytes of data and is being used extensively for both reporting and ad-hoc analyses by more than 100 users. The rest of the paper is organized as follows. Section 2 describes the Hive data model and the HiveQL language with an example. Section 3 describes the Hive system architecture and an overview of the query life cycle. Section 4 provides a walk-through of the demonstration. We conclude with future work in Section 5.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Massive Multi-Omics Microbiome Database (M3DB): A Scalable Data Warehouse and Analytics Platform for Microbiome Datasets

Massive Multi-Omics Microbiome Database (MDB) is a data warehousing and analytics solution designed to handle diverse, complex, and unprecedented volumes of sequence and taxonomic classification data obtained in a typical microbiome project using NGS technologies. MDB is a platform developed on Apache Hadoop, Apache Hive and PostgreSQL technologies. It enables users to store, analyze and manage...

متن کامل

A Context-Based Performance Enhancement Algorithm for Columnar Storage in MapReduce with Hive

To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the clusterbased architecture. In this context, MapReduce has emerged as a promising architecture for large scale data warehousing and data analytics on commodity clusters. The MapReduce framework offers several lucrative features such as high fault-tolerance, scalability and use of a variety of ha...

متن کامل

Adaptive Prejoin Approach for Performance Optimization in MapReduce-based Warehouses

MapReduce-based warehousing solutions (e.g. Hive) for big data analytics with the capabilities of storing and analyzing high volume of both structured and unstructured data in a scalable file system have emerged recently. Their efficient data loading features enable a so-called near real-time warehousing solution in contrast to those offered by conventional data warehouses with complex, long-ru...

متن کامل

ARPN Journal of Science and Technology::Analysis of Movie Lens Data Set using Hive

Large scale data set provides the better opportunity to find out much better data relationship in the area of business intelligence. In the paper, we implement our systems using Hadoop that has been popular to store and compute Big Data. However, it is not easy to write Hadoop Map Reduce code. Therefore, we use Hive and Hive QL codes to understand the relationships between ratings and the users...

متن کامل

A Framework for Semi-Automated Implementation of Multidimensional Data Models

Data warehousing solution development represents a challenging task which requires the employment of considerable resources on behalf of enterprises and sustained commitment from the stakeholders. Costs derive mostly from the amount of time invested in the design and physical implementation of these large projects, time that we consider, may be decreased through the automation of several proces...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2009